Clustering suggestion for Chinese news web pages from multi-media sources

نویسندگان

  • Deng-Yiv Chiu
  • Ya-Chen Pan
چکیده

There exist some news obviously classified into incorrect categories on Chinese web pages portal. The main reasons could be that it is difficult to automatically classify Chinese news and the news appearing on web pages portal are retrieved from many media sources. In this study, we integrate genetic algorithm and multi-class support vector machine (SVM) classifier to construct a Chinese news classification method. In addition, we find that some similar documents are scattered in different categories. The main reason could be that the categories of original media sources are different from those of news web pages portal. Those similar news should be collected to form a new category. We try to combine genetic algorithm and fuzzy c-means algorithm to propose a new approach to offer clustering suggestion for news web pages that are scattered in different categories and are from multi-media sources.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distribution of news information through social bookmarking: an examination of shared stories in the Delicious Website

Introduction. This study examined the selection and sharing of news stories from Delicious, a popular social bookmarking site, in order to identify the most frequently consulted news information sources and news topics. Method. Targeting US-specific sources through initial computer screening of URLs, we employed content analysis to further analyse story topics and sources that were unclassified...

متن کامل

Semantically Enhanced Television News through Web and Video Integration1

The Rich News system for semantically annotating television news broadcasts and augmenting them with additional web content is described. On-line news sources were mined for material reporting the same stories as those found in television broadcasts, and the text of these pages was semantically annotated using the KIM knowledge management platform. This resulted in more effective indexing than ...

متن کامل

A Near-duplicate Detection Algorithm to Facilitate Document Clustering

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...

متن کامل

Clustering for Web Information Hierarchy Mining

Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. The structures of Web pages which are dynamically generated by the same templates are thus similar to one another and are usually assembled by a set of fundamental information clusters These neighboring information clusters usually represent the similar semantics...

متن کامل

A Platform for Multilingual News Summarization

We have developed a multilingual version of Columbia Newsblaster as a testbed for multilingual multi-document summarization. The system collects, clusters, and summarizes news documents from sources all over the world daily. It crawls news sites in many different countries, written in different languages, extracts the news text from the HTML pages, uses a variety of methods to translate the doc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011